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Abstract 

This paper studies the instantaneous rate maximization and the weighted sum delay minimization prob- 
lems over a ii'-user multicast channel, where multiple antennas are available at the transmitter as well as 
at all the receivers. Motivated by the degree of freedom optimality and the simplicity offered by linear 
precoding schemes, we consider the design of linear precoders using the aforementioned two criteria. 
We first consider the scenario wherein the linear precoder can be any complex-valued matrix subject to 
rank and power constraints. We propose cyclic alternating ascent based precoder design algorithms and 
establish their convergence to respective stationary points. Simulation results reveal that our proposed 
algorithms considerably outperform known competing solutions. We then consider a scenario in which 
the linear precoder can be formed by selecting and concatenating precoders from a given finite codebook 
of precoding matrices, subject to rank and power constraints. We show that under this scenario, the 
instantaneous rate maximization problem is equivalent to a robust submodular maximization problem 
which is strongly NP hard. We propose a deterministic approximation algorithm and show that it yields 
a bicriteria approximation. For the weighted sum delay minimization problem we propose a simple 
deterministic greedy algorithm, which at each step entails approximately maximizing a submodular set 
function subject to multiple knapsack constraints, and establish its performance guarantee. 
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I. Introduction 

Next generation wireless networks will require a spectrally efficient physical layer multicasting scheme 
in order to cater to important emerging applications such as real-time video broadcast, wherein a common 
information needs to be simultaneously transmitted to multiple users. The design of spectrally efficient 
physical layer multicasting schemes via instantaneous rate maximization has consequently received 
significant recent attention. The seminal work of HI considers the design of the instantaneous rate 
maximizing transmit beamforming (a.k.a. rank-1 linear precoding) scheme for multicast and proves it to be 
an NP-hard problem. Efficient albeit sub-optimal designs of transmit beamforming (or equivalent rank-1 
transmission schemes) for multicast have thus been proposed in [H, lH. In addition, a hidden convexity 
of the multicast beamforming problem under certain channel conditions has been recently discovered 
in lfT4]| . Another approach for designing beamforming vectors for multicast has been adopted in |[T2l . 
In particular, |[T2l assumes that users have been partitioned into non-overlapping user groups and then 
proceeds to design beam vectors (one for each group) and their power levels. Several efficient heuristics 
are suggested. This approach is further pursued in ifTTI . where formation of groups is also considered and 
transmissions pertaining to different groups are made orthogonal. Long-term beamforming for scenarios 
where instantaneous channel state is unavailable at the transmitter has been addressed in Q. On the 
other hand, the optimal (i.e., instantaneous rate maximizing) linear precoding based multicasting scheme 
without rank constraints can be obtained via convex optimization ||6|. The scaling results derived in 161 
reveal that higher rank precoding is beneficial in the ubiquitous regime in which the number of users is 
larger than the number of transmit antennas. Indeed in this regime an open loop scheme with identity 
matrix precoder (whose size is equal to the number of transmit antennas) is asymptotically optimal. 

This paper intends to address the main issue with such higher rank precoding for multicast, which is 
the increase in the decoding complexity at each user, particularly when the rank exceeds the number of 
its receive antennas. In particular, we consider the problem of designing linear precoders for multicast 
subject to a given rank constraint, which allows us to address the trade off between spectral efficiency and 
decoding complexity. Compared to an existing recursive design based approach for constructing linear 
precoders for multicast lH (see also lH) which can also accommodate an input rank constraint, our 
approach introduces auxiliary variables to reformulate the optimization problem and uses an alternating 
optimization method ||34|| to achieve a Karush-Kuhn-Tucker (KKT) stationary point. We note that an 
antenna subset selection scheme for multicast, which selects the optimal transmit antenna subset of a given 
size (assuming identity matrix precoder of that size), has been analyzed in fj\. Furthermore, alternating 
optimization based algorithms have been proposed for several multicast precoder design algorithms in 
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lITOl all of which involve the decoding mean squared error. Here we consider the achievable rate instead, 
which involves introducing more auxiliary variables in the alternating optimization and the resulting proof 
of convergence is also different. 

In addition to the transmission rank constraint, in certain practical systems each user needs to be 
explicitly signaled about the choice of the precoder employed by the transmitter, thus necessitating the 
choice to lie in a finite codebook. Instead of considering an optimal albeit unstructured finite codebook 
design, we focus on a more practical setup entailing a lower memory footprint and signaling overhead, 
wherein a higher rank precoder is constructed by concatenating codewords from a given (base) codebook 
of precoding matrices. Under this scenario we show that the instantaneous rate maximization problem 
falls in the realm of the robust submodular optimization |[30l and is strongly NP-hard. We propose a 
deterministic approximation algorithm and show that it yields a bicriteria approximation. 

Another precoder design metric of interest for physical layer multicasting is the weighted sum delay. 
The pertinent delay for each user is defined as the number of time intervals needed to accumulate 
enough information for decoding a common message; and the weight assigned to a user is determined 
by its priority in the multicasting system. Linear precoder design to minimize the weighted sum delay 
is considered under rank and power constraints as well as under a finite codebook-constraint, for which 
the alternating optimization and the submodularity, respectively, again become instrumental to develop 
efficient algorithms. We note that sum delay minimization over a discrete codebook has been recently 
considered in lITSl . However, the innovative algorithms designed in |[T5l are based on an assumption 
(which holds for strongly LOS channels) that each user can receive its data from only one beamforming 
vector in the codebook and that all other vectors are essentially in the null space of that user's channel, 
i.e. transmission along any such vector will result in a negligible received signal strength at the user. In 
contrast, we make no such assumption and indeed allow each user to accumulate its useful signal across 
several intervals (where one or more precoders are employed for transmission in each interval) till it 
meets a threshold for reliable decoding. 

The rest of the paper is organized as follows. Section [ll] presents the system model and formulates 
the two aforementioned precoder design problems. Efficient algorithms for maximizing the instantaneous 
rate are developed in Section |llll while Section |IV] switches to the weighted sum delay minimization 
problem. The proposed algorithms are tested and compared numerically to other known approaches in 
Section |V] and the conclusions are presented in Section |Vll 

Notation: Upper (lower) boldface letters will be used for matrices (vectors); (•)^ denotes the complex- 
conjugate transposition; Tr(-) the matrix trace; rank(-) the matrix rank; the all-zero matrix; || • \\f the 
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matrix Frobenious norm; and | • | the cardinality of a set as well as the determinant of a square matrix. 

II. System Model and Problem Statement 

We consider a MIMO wireless physical layer multicasting system consisting of a base station (BS) 
equipped with M transmit antennas and K users, where the k^^ user is equipped with receive antennas 
for A; = 1, . . . , J^. All the K users receive common information from the BS. We let x"^ G C^^ denote 
the signal vector transmitted by the BS on slot r G Z+, where a slot denotes a resource unit in the code, 
frequency or time domain. Further, let G C^'' be the signal vector received by user /c = 1 , . . . , K on 
slot r. Then, the input-output (I/O) relationship for the A;-th user is modeled as 

y^ = H^x- + z^, Vfe (1) 

where G C^*"^^ is the channel matrix that models the channel seen by the A;-th user from the 
BS on slot T, and G C^*^ is the additive complex Gaussian noise vector at the /c-th user. The noise 
vectors are assumed to be mutually independent (across slots) complex Gaussian vectors and without 
loss of generality (Wlog) they are each assumed to be white, i.e., ~ CA/'(0,I). This is possible 
via a whitening filter which can be absorbed into the channel matrix H^. A power budget is imposed 
on the transmitted signal as E[||x^p] < P, \/t G It is further assumed that estimates of all the 
channel matrices {H^} in ([Hi are available at the BS, possibly by exploiting reciprocity or feedback. In 
this paper for simplicity we assume that error free estimates are available to the BS. Nonetheless, the 
design methods presented in the sequel can be generalized to the scenario where only imperfect channel 
estimates are available. For example, one approach is to mimic the naive zero-forcing based precoding 
design for multiuser MIMO and let the BS design the precoders after assuming the channel estimates 
available to it to be perfect. Another more sophisticated approach is also possible by explicitly modeling 
the CSI errors; see for instance lUl, 191. 

Next, consider a simple communication scheme that uses linear transmit precoding at the BS. To this 
end, suppose d symbol streams are simultaneously transmitted by the BS on each slot and let s"^ G 
denote the coded and modulated symbol vector with G C^^^*^ denoting the corresponding precoding 
matrix. Thus, the transmitted signal at the BS becomes x"^ = W^s^, and the Input/Output relationship 
per user k is given by 

y^ = H^W-s- + z^, VA:. (2) 
Wlog the encoded symbol vector is assumed to satisfy E'[s'^s'^^] = I. Therefore, the achievable rate at 
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the k-th user for the scheme Q can be expressed as 



RliW^) = log 



I + HIW^W^ TH 



r fTTTt 



yk. (3) 



Then, given the multicast system ^ and some prescribed precoder codebook C (as detailed later), we 
are interested in the problem of selecting the precoding matrix G C under the following two goals. 
The first design criterion is to achieve the best instantaneous throughput on each slot r, or equivalently 
maximize the minimum of the rates {RJ,} among all the K users. For simplicity, the slot index r can 
be omitted under this scenario, and the problem of interest becomes 



(PI) max mill Pa.(W). (4) 
Wee k=i,...,K ' 



Clearly, the precoder design problem (PI) focuses on the instantaneous throughput at each channel use 
that can be achievable for all users. In some circumstances, it is more meaningful to look at a weighted 
average performance across all the K users. Here, we consider a quasi-static fading scenario where in 
each scheduling interval (defined over the time domain) the BS repeatedly transmits the same message 
over L orthogonal slots. The BS continues transmitting across successive scheduling intervals till at every 
user the accumulated information exceeds some threshold Q. The threshold rate is chosen such that 
enough information has been collected in order to reliably decode the transmitted message, for example 
via rateless coding/decoding OTl Ch. 50]. Under this scenario, the incurred delay at the A;-th user (in 
terms of the number of scheduling intervals) to decode the transmitted message is given by 

Dk{{^"}) ■■= min {t G Z+ : ^^1, P^(W-) > e} . (5) 

Note that in ([5]) we have assumed a quasi-static fading setup, where within the time horizon of interest 
the channel per user k remains invariant across all scheduling intervals, i.e., H^^"*"^ = H^, 1 < i < L 
and t G Z+. This assumption is reasonable for instance over a wideband orthogonal frequency division 
multiplexing based multiple-access (OFDMA) system, where the users have low mobility. There each 
scheduling interval comprises of consecutive OFDM symbols and several such scheduling intervals are 
within the coherence time. Each slot in a scheduling interval is formed by a set of consecutive sub-carriers 
and OFDM symbols, where the set of consecutive sub-carriers is well within the coherence bandwidth 
so that each slot can be represented by one channel matrix. Then, the goal is to jointly design a sequence 
of precoders {W^} which together minimize the weighted sum delay among all the K users; that is. 



K 



(P2) min Yl f^kDk{{W-}) 



(6) 
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where the weights {fJ,k}k=i determine each user's priority. Furthermore, to specify the constraints on 
the precoding matrices for both (PI) and (P2), two interesting codebook scenarios are introduced, as 
explained below. 

C1. Continuous codebook Cc- allows the precoder W to be any arbitrary complex-valued matrix 
subject to norm and dimensionality constraints. As a result of limited computational capability at 
the users, the BS can afford to simultaneously transmit at most d > 1 symbol streams, where 
we note that a larger d increases the corresponding decoding complexity. Then, incorporating the 
transmitter power constraint the continuous codebook can be specified as 



Cr. 



:= I W G C^-^^"^ II W||^ < P} . (7) 



Note that the continuous codebook is applicable in a scenario where over each slot of every 
scheduling interval, pilots precoded by the chosen precoder can be transmitted so that each 
user k can directly estimate H^W^. 
C2. Discrete codebook Cd- Such a codebook is motivated by a practical scenario where precoded pilots 
are not available and where the signaling overhead (needed to indicate the choice of precoder to the 
users) is limited. In this case the BS can use a precoder that is formed by concatenating precoders 
from a known base codebook W comprising of a finite number of matrix codewords. It is assumed 
that ||W'|||, = 1, V W G W. Let e = (W, r,p) denote an element, where W G W, r equals to 
the column dimension (and rank) of W such that W' G C^"*^^^, and p determines the power level 
by which W can be scaled. Further, let £_={§_= (W',r, p) : w G W,r = rank(W') G Z+} 
denote the ground set of all possible such elements, which is known to the BS (and to all users) 
in advance. For any such element in £_ we adopt the convention that 

e = (W', r,p) ^ We = W ; re = r ; pe=p . (8) 
Thus, each precoder in Cd corresponds to some subset of elements C ^, as given by 



:= W 



{ VPe^e}eeu] ,ru<d,pu<P} (9) 



where we follow the notational convention 

K'^S ru = ^re] PU = ^Pe- (10) 

As the counterpart of the matrix dimension constraint in Cc, the sum dimension one of ^ ensures 
at most d streams are transmitted. In addition, the sum power constraint of ^ is akin to the 
Frobenius norm one in (|7]l. Note that the concatenation based approach of designing Cd has a 
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smaller memory footprint, facilitates simpler search algorithms for determining a suitable precoder 
and can also reduce the signaling burden compared to a finite albeit unstructured codebook. 
With these two codebook settings, the goal is to design the precoder matrix (matrices), bearing in mind 
the aforementioned criteria in (PI) and (P2). The next section will address the first problem of maximizing 
the instantaneous throughput. In what follows, we collect essential results that follow directly from known 
results as lemmas (after proper citation) and collect the novel results in propositions. 

III. Maximizing the Instantaneous Throughput 

This section focuses on the one-snapshot problem (PI), which maximizes the minimum among the 
rates achievable at all the K users for any given time instance. As mentioned earlier, in this whole section 
the slot index r can be dropped for simplicity. 

A. Continuous Codebook 

Notice that the problem (PI) with the continuous codebook Cc is an NP-hard problem since the 
particular case with d = 1 is known to be NP-hard HI- Then, to efficiently obtain sub-optimal solutions, 
it is useful to first consider a simple linear decoding scheme at each user. To this end, denote Gk G C^'' ^"^ 
as the linear receive filter per user k. With the system model the output of the k-th receive filter can 
be expressed as 

Sk = Glyk = GlUkWs + Glzk, VA:, (11) 

with the corresponding mean-squared error (MSE) matrix of estimating the signal s given by 

Efe(Gfc,W) =E[(sfe-s) (Sfc-s)t 

G[,HfcW - Id) (cTHfcW - Irf) V GlGk. (12) 

Interestingly, the MSE matrix Efc(Gfc, W) in (fT2l ) can be related to the achievable rate iifc(W) of (0]), 
as detailed in the following lemma (cf. |fT6l ). 

Lemma 1: For a given precoding matrix W, the achievable rate i?fc(W) per user k in @ can be 
obtained by solving the optimal receive filter problem as follows: 

i?fc(W) =max log|E^^(Gfc,W)| (13) 

where its optimum is attained at the linear minimum MSE (LMMSE) filter for the k-th user; that is, 

Gk = (HfcWWtH^ + In^Y^ H,.W. (14) 



IEEE TRANSACTIONS ON SIGNAL PROCESSING (SUBMITTED) 



8 



Unfortunately, the variables {Gk} and W together do not allow decomposing ^ into solvable sub- 
problems. Consequently, we introduce more auxiliary variables which allow us to decompose ([4]) to 
optimally solvable sub-problems. Towards that end, we state the following lemma which was proposed 
and used to design precoders over the MIMO broadcast channel (with unicast transmissions) in |[T6l and 
later for the MIMO interference channel in jSl . 

Lemma 2: For any given precoderW E C^'^""^ and any filter Gk £ C^"""^, the MSB matrix 'Ek{Gk,W) 
is positive definite and the following holds 

max {-TY(SfeEfe) + log|Sfc|+d} = log|Efc(Gfe,W)-i|, (15) 

where the optimum is attained at = Efc(Gfc, W)~^. 

It can be verified that for a given precoder W and any given ;^ the solution to minG^ Tr(SfcEfc(Gfc, W)) 

is also achieved at (fT4l) . Then, to make the problem decomposable, using Lemma |2] introduce the (matrix) 

slack variables {Sk G C'^^''}^^]^, one per user k. With the equivalence asserted in Lemmas [T] and |2l and 

using the continuous codebook Cc in (|7]l, the instantaneous throughput maximization problem (PI) can 

be reformulated as 

( max ^ -Tr[SfcEfc(Gfc,W)] +log|Sfc| + d ^ (16) 

liw|||,<f> k=l,...,K 

{Gt,Sfc^O} J 

where the MSB matrix Efc(Gfc, W) is given by (fT2l ). 

Interestingly, not only the reformulated problem (fT6l) is equivalent to (PI), each stationary point of 
(fT6l) also yields a stationary point of (PI). The latter fact follows upon invoking the gradient expressions 
given in |[T6l and is shown in the sequel. Further, the reformulated problem ( fT6l ) also allows us to use 
cyclic alternating ascent (CAA) algorithm to decompose it into sub-problems that are solvable. For a 
fixed W, the problem in ( fT6l ) can be be optimally solved over {G^jS^}. This is because upon further 
fixing {Sk >- 0}, the problem in ( fT6l ) reduces to that of minimizing the weighted MSB cost over linear 
filter Gfc per user k, with the closed-form solution given by (fT4l ): then using those {G^}, it reduces to 
the problem in Lemma |2l which admits closed-form solution (Efc(Gfc, W))^^ = W^H|.HfcW + 1^. A 
slightly more complicated sub-problem appears when solving the precoder W while fixing both {G^} 
and {Sfe}. To tackle this sub-problem, consider its equivalent form given by 

max B (17a) 

|iw|p^<p 

s.to -T\'[SfcEfc(Gfc,W)]+log|Sfe| + d>/3, VA: (17b) 

where at the optimum of (ITtI) . /3 becomes equal to the minimum of achievable costs among all K 
users. Furthermore, define the following Cholesky factorization per user fc as = BfcB|,, and thus the 
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constraint (llVbl ) can be cast as a quadratic cone one in terms of the variable W, and the sub-problem 
for solving it becomes 

max /3 (18a) 

l|W|||<P 



s.to Cfc — /3 > 



, Vfc (18b) 

F 



9 

where the constant := log |Sfc| + d — WG^^kWr- Notice that the other power constraint on W is a 



quadratic one, hence, the sub-problem (118]) for obtaining W while fixing the others is a second-order 
cone program (SOCP), and thus can be solved efficiently using some off-the-shelf optimization tools, 
e.g., the interior point optimization routine in SeDuMi ll33l . 

These aforementioned sub-problems suggest an iterative CAA algorithm yielding successive estimates 
of one of the two groups of variables - {G^, Sfc}, and W - with the remaining group fixed, as tabulated 
in Algorithm [T] The convergence of Algorithm [T] in terms of the objective value is guaranteed due to 
the cyclic ascent nature of the algorithm that ensures a monotonically non-decreasing objective across 
iterations. However, proving the convergence for the sequence of iterates is more involved. The following 
convergence claim applies for Algorithm [T] when it is invoked without any limit on the number of 
iterations. A similar CAA convergence result is outlined in |[T3l . but for a different problem setup involving 
MIMO interference channels. 



Algorithm 1 : (PI) with Cc- Input the channel matrices {Hfc}^^, and an initial feasible W. Output the 

iterates upon convergence. 
1: while the iterates converge or maximum number of iterations is reached do 

2: for /fc = l,...,Er do 

3: Obtain the LMMSE optimal receive filter as ^ (^H^WW^H^ + Ia^,) H^W . 

4: Update the slack matrix ^ W^H|.HfcW + 1^, with the MSE matrix calculated via ([T2l) . 

5: end for 

6: Obtain the precoder matrix W by solving the SOCP problem ([TSl l. 

7: end while 



Proposition 1: Either the sequence of iterates generated by Algorithm\J\ converges to a stationary point 

or each of its accumulation points is a stationary point of (PI), and the objective is non-decreasing as 

the iterations proceed. 

Proof: The proof is given in Appendix |Al 
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B. Discrete Codebook 

From the definition of in (|9l), each valid precoder corresponds to a subset U_ ^ £_. Thus, the 
achievable rate for user A: in Q can be considered as a set function i?^ : 2- — given by 



V/c (19) 



RkU) = log I + J^PeHfcWeWtH^, 

for all H'^ £_■ We offer the following useful result. 

Proposition 2: The set function Rk{-) in ( 1191 ) is a submodular set function, i.e., 

RkiH U {e}) - RkU) > RkW U {e}) - RkW), VA;, (20) 

for all ly[ U! Q £. and e ^ £_. Further, it is also monotonic as Rk{U.) < Rk{U.')> "^M. ^ U^> <^nd 

normalized Rki^) = where denotes the empty set. 
Proof: The proof is given in Appendix |B] 

Thus, (PI) with the discrete codebook Cd in Q becomes a robust submodular function maximization 

problem, given by 



max min RkQd.) s.to ru < d, pu < P ] (21) 

UC£ k=l,...,K 



For general submodular set functions, maximizing a robust criterion with even one constraint has been 
shown to be strongly NP hard |[30l . Here, we show that for the particular submodular set functions given 
in ( fT9l ). the robust rate maximization problem in (|2T]) with only the power constraint, i.e., the problem 



max min RkiW s.to pu_< P | (22) 
1 . • • • • 



is also strongly NP hard, as asserted in Proposition |3] Note that an instance of the problem in (l22l ) 
comprises of: the number of users K along with their channel matrices {Hfc}|^j^, the set £_ (specified via 
a base code book W of precoders and a power level for each precoder in W) as well as the power budget 
P. In particular, we show that (l22l ) is NP hard even over instances where we restrict K = 0{\8\^) for 
any arbitrarily fixed positive integer A > 2. 

Proposition 3: Unless P—NP, there cannot exist any polynomial time approximation algorithm for (|22l) . 
More precisely: If there exists a positive function 7 : Z4. — t- M+ and an algorithm that, for all \£\ and 
P, in time polynomial in \£\, is guaranteed to find a subset 1/ satisfying the power constraint P such 
that 

min > 7(1^1) max mini?fc(W), (23) 

then P^NP 

Proof: The proof is given in Appendix |C] 
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Proposition |3] manifests the hardness of the discrete precoder design problem, and that polynomial- 
complexity algorithms cannot approximate the optimal rate within a bound that is only determined by 
1^1 . In the following, we consider the problem (l22l ) and adopt a bicriterion optimization approach. We 
leverage the Submodular Saturation algorithm (SSA) developed in |[30l . which considers the general 
robust submodular minimization problem but can offer guarantees only for integral valued submodular 
functions. Since the submodular functions that are of interest to us are not integral valued, we modify the 
SSA by using recent results for the submodular set-cover problem, wherein the submodular cost function 
can be real- valued 1281 . 

Following the SSA, the proposed algorithm exploits the idea of the bisection method which is applied 
to the following equivalent formulation of (l22l ): 

{c,lf\ := arg max c, s.to RkiU.) ^ c, V/c and pu < P. (24) 

The equivalence between (l24b and (l22l) holds, since at the optimum of (l24l ). the value c will always be 
equal to the minimum of {RkiQ.)} across all the K users. Now suppose that there exists an algorithm 
that, for any given value c, solves the following optimization problem 

:= arg min pu, s.to RkiU.) ^ c, = 1, . . . , K, (25) 

then the power associated with the optimum set can be used to decide the relationship between the 
prescribed value c and the optimum c in (l24l ). Specifically, if it turns out that p^ < P, then c is feasible 
for (l24l ) and it must hold that c < c. Otherwise, the chosen value c is infeasible for (l24l ) and we have 
c > c. Hence, an iterative binary search on c would then allow us to find the maximum value that is 
feasible to (l24l) . However, the problem ( [251 ) is not exactly solvable, but can only be approximated as 
shown below. 

To illustrate this, consider any feasible value c and the truncated function Rk^dU.) '■= niin{i?fc(Z^), c}. 
Let RciU) := i'^/K) '}2!k=i Rk,c{ld.) be their average function, which is also submodular and monotonic 
(follows from a result in |[30l ). With these definitions, we have Rc{Q = c, and the constraint in (l25l) 
holds if and only if RcQd.) = c, which establishes the equivalence between (1251 ) and the following one 

:= arg min py_, s.to RdU.) = Rc{£.)- (26) 

Interestingly, the reformulated problem (l26l ) is an instance of the submodular covering problems. A greedy 
algorithm has been proposed in ll35l to approximately solve such problems but that algorithm yields a 
useful guarantee only for integral valued submodular functions. Recall that the submodular functions 
{Rk{-)} in ( fT9l ) are not integral-valued. Consequently, we employ a variation of the greedy algorithm 
proposed in ll28l and given here in Algorithm |2l 
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Algorithm 2 : (l26l ) with a feasible c. Input the channel matrices {H^}, 6 G (0, 1) and the ground set £_. 

Output the greedy solution Uq to (l26l) . 
Initialize Uq = 0. 

while -Rc(^g) < c(1 - 5) do 

Update ^ U {argmax^^,^^^ RM^lMhEM^^ . 
end while 



The following lemma follows from Theorem 1 of |[28l when the latter is invoked using submodular 
set function Rc{-), threshold c (which we note is feasible for Rc{-), i.e., Rc{Q > c) and a gap c6, where 
5g (0,1). 

Lemma 3: With a monotonic real-valued submodular function Rc{-), any 5 G (0,1) and a (feasible) 
value c, Algorithm\2\finds a set U_q such that RcitLc) ^ c(l — 5) and < (1 + ln(l/5)), where 
is an optimal solution to (1261 ). 
Note that the greedy Algorithm |2] can only approximate the optimal solution U^. This prevents from 

implementing the bisection method based on the equivalence between (l22l ) and (l24l ). since solving the 
latter requires to find the exact optimal solution to (1251 ) per bisection iteration for any given c. 
Therefore, we need to adapt the original binary search procedure in order to accommodate the greedy 
approximation algorithm. In particular, for any specified 5 G (0, 1), the binary search criteria budget per 
iteration is scaled to P(l + ln(l/5)), and the corresponding decision rule is also changed as follows: if 
Algorithm [2] outputs > P(l + ln(l/(5)), the chosen value c is infeasible to ((24l) and c > c; otherwise, 
the output Uj, is a feasible solution to a relaxed version of (l24b with budget P(l + ln(l/(5)), and will 
be kept as the best current solution to it. Such adapted bisection method is tabulated in Algorithm [3] 
which is polynomial time (for any fixed e) and has the following optimality, as asserted in the following 
proposition. 

Proposition 4: For any power budget P and given 6,€ £ (0, 1), Algorithm\3\finds a solution U_ such that 
miiiRkiU) >{l-K5) max mm Rk{U) -e{l- K5) {11) 

k Uj.pu<P k 

and < P(l +ln(l/(5)). 

Proof: The proof is given in Appendix |D] 

Remark 1: In practice Algorithm [3] gives good results when invoked with S = but where the condition 

Pjj > P(l + ln(l/5)) is replaced by > P. In addition, simple enhancements such as replacing the 

search space e G ^ \ IIq in Algorithm |2] with e G ^ \ lie '■ Pguw — ^ (when the latter is invoked by 

Algorithm O also improve performance. 
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Algorithm 3 : (l24b with Cd- Input the channel matrices {H^}, 6, tolerance e and the ground set £_. Output 
the set U. 

InitiaUze Cmin = 0, Cmax = miiifc Rk{£), and ^ = 0. 

while Cmax - Cmin > £ dO 

Set c ^ {cmin + Cmax)/2, and define RdU.) := {l/K) Y.k=i niin{i?fc(Z^), c}. 
Use Algorithm |2] with input 5 to obtain the greedy solution Uq. 
ifpjj^ > P(l + ln(l/5)) then 

Update Cmax ^ c. 

else 

Update Cmin ^ c and U_ ^ U_q. 
end if 
end while 



Finally, it is useful to derive an upper bound for (l22l) to benchmark the performance of Algorithm |3l 
as given by 



max /3 



log 



1 + J]peXeHfcWeWtHT 



> /3, V A;, 



XePe < ^0 < Xe < 1 , V e G ^. 



(28) 



Notice that (l22l ) and (1281 ) are equivalent if we enforce stricter constraints Xe G {0, 1}, V e G ^ in (|28] ). 
Then, an important observation that can be made using |[22l pp. 74, is that for each 1 < k < K, the 



function log 



I + ^gg^ pe^Hfc We w|h|. is jointly concave in [pe]ee£ ^ Consequently, it follows 
that (1281 ) is a convex optimization problem that can be efficiently solved. 



IV. Minimizing the Weighted Sum Delay 

A different precoder design criterion is considered in this section. Specifically, in contrast to focusing 
on the minimum instantaneous throughput among the users as in Section |lIIJ a weighted performance in 
terms of decoding delay across the K users (P2) becomes the subject of interest. In order to make (P2) 
more tractable, first consider a different expression for the decoding delay given by 



oo 

Dfc({w-}) := 1 + J] [i - 1 ( ^ i?uwn/e 

t=l 



(29) 
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where the function l(-) defined as l(x) = 1 if x 1 and if < x < 1» indicates whether the threshold 
G has been reached for the accumulated rate per user k. (Notice that the function value for negative 
X can be disregarded, since the achievable rate is always non-negative.) Clearly, in (|29l ) the delay 
simply counts the number of scheduling intervals that are needed for the non-decreasing accumulated rate 
to cross and the common message to be reliably decoded. Although the delay Dj^ can be expressed 
as an analytical function of the precoder matrices as in ( [29l) . the indicator function l(-) still makes the 
problem (P2) difficult to solve. In the following, we will first consider optimizing (P2) over the continuous 
codebook and then the optimization over the discrete codebook. 

A. Continuous Codebook 

We consider solving (P2) with the continuous codebook {Cc}. Notice that the indicator function l(x) 
is discontinuous at the point x = 1 ^^d this discontinuity in the cost as a function of the accumulated 
rate will render it difficult to optimize the precoding codewords. Even upon employing an alternating 
optimization approach as in Section IIII-AI the resultant sub-problems are non-convex and not easily 
solvable. As the difficulty lies in the discontinuity, we propose to relax the indicator function as 



I 1 otherwise 

Since the accumulated rate is never negative, the weighted sum delay minimization problem (P2), with 
the delay D^dW^}) lower bounded by substituting the relaxed Ir(-) of ( [30l ) into ( [291 ). is relaxed to 



one non-zero channel matrix and further suppose that Q is finite. Then, 3i < oo such that any optimal 
solution {Wqp^} to (P2') can be truncated by setting W^p^ = for all t > Lt, without sacrificing 
optimality. 

Proof: The proof is given in Appendix El 
We emphasize that i in Proposition [51 can be determined as a function of only the given input channel set 
and the threhold. Clearly the truncation can be done without loss of optimality by by setting W^p^ = 
for all r > Lt', for any t' > i, as well. Further, note that Proposition [51 implies that without loss of 
optimality (P2') can be regarded as a finite dimensional optimization problem in which the set of feasible 
solutions is compact. Next, in order to solve the relaxed problem (P2'), we adopt the following approach. 




(30) 




We offer the following result. 
Proposition 5: Suppose that for each user k : 1 < k < K the input channel set {H^};^^^ has at-least 
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We start by considering a particular choice i for the number of scheduling intervals and pose the following 
problem 



K i 



1^ e^<=> fc^i 



(32) 



Then, to solve (P2") we leverage the same approach as in Section ITlI-AI and decompose it into optimally 
solvable sub-problems. To this end, we introduce the linear filters {G^ G (^N^^d^Lt^^ corresponding 
MSE matrices {E^(GJ, W^)}^i I as in ([T2]) and the matrix slack variables {S^ G C^^'^}^'Li, per user 
k. Invoking Lemmas [1] and |2j the problem (P2") can be written as 



K t 

max y y iik 

{W-GC*'x'':||W-|||,<P},{Gj},{Sj:^0} ^ ^ 



Lt 



"^"^ k E [-Tr [SI ^1{GI, W-)] + log |S^| + d] , 1 !> . 

fc=l i=l I " T=l J 



(33) 

Hence, the CAA algorithm is applicable to the relaxed problem (|33] ). Fixing {W^}, the problem in (l33l) 
can be optimally solved over {G^, S^}, using Lemmas [T] and [2l It now remains to update all the precoders 
{W^}, while fixing the other variables, {G^} and {S^}. Using {a^}^^]^ to denote the minimum between 
the accumulated normalized rate and the unit threshold for each user k, the aforementioned sub-problem 
for {W^} is equivalent to 



max 

{W-GC'^fx<':|lW-|||.<P},{a*} 



K t 
k=l t=l 



Lt 



s.to Qai < {-TV [SI BliGl, W^)] + log \Sl\ + d} 



(34a) 



(34b) 



r=l 



al<l, yk,t (34c) 

Furthermore, defining the Cholesky factorization per user k and slot r as = B^(Bpt, the problem 
( [34b can be reformulated as 



K t 



max 

{W-eC*'x'':||W-|||,<P},{«*,/3j} 



k=l t=l 

Lt 

S.to Qai < J2{-\\GlBl\\l -Pl + \og\Sl\ + d} 



(35a) 



(35b) 



r=l 



Pl>\\{Bl)^ (GD^H^W--Irf yk,T (35c) 

ML -i r 

ai < 1, V/c,t. (35d) 

Thus, the constraint (I34bl ) is represented by the linear constraint in (I35bl ) together with a series of 

quadratic cone constraints in (|35cl l. Notice that other constraints on the power of each are also 
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quadratic. Hence, the sub-problem (1351 ) for obtaining {W^} while fixing the remaining variables is also 
an SOCP, and can be solved efficiently as mentioned earlier. 

These aforementioned sub-problems with their optimal solutions suggest an iterative CAA algorithm 
to solve (P2"). However, upon convergence the accumulated rates of some users could be below G in 
which case we can increment i and repeat the process. This procedure is tabulated in Algorithm H] Notice 
we have assumed that a set of precoders {W^^^ G Cc}f^i yielding a rate vector A ^ (componentwise 
strictly greater than zero) over any scheduling interval is provided as an input. Such a set can be found by 
using the CAA algorithm of Section UlI-AI on the input channel matrices {H0, l<i<L,l<k<K. 
Indeed, an admission control module can be implemented in which the group of users to receive a 
common message is decided by verifying whether the instantaneous rate optimizing algorithm of Section 
IIII-AI when used over that group can achieve a strictly positive (or a large enough) value for the minimum 
instantaneous rate. 

Algorithm 4 : To approximately solve (P2'). Input i, the channel matrices {H^}, l<T<L,l<k<K, 

a feasible set of precoders {W^^^}^^^ yielding rate vector A ^ 0. Output the final iterates, 
while For at-least one user k the accumulated rate is below Q — Afc do 

Increment i i + 1 and initialize {W^}:^^^ 

repeat 

for k = 1, . . . , K do 

for r = 1, . . . , Lt do 

Given the precoder at slot r, update the LMMSE optimal receive filter as ^ 
[H-W-(W-)t(H-)t+i^J-iH-W- . 
Update the slack matrix ^ (W^)t(HptH^W^ + 1^. 
end for 
end for 

Obtain the precoder matrices {W^} by solving the SOCP problem (1351) . 
until Convergence 
end while 

Let {W^}:^^| be the iterate upon convergence or an accumulation point. Output it if accumulated rate 
of each user is no less than 6, else augment {W^}^^^ by {W^*+^ = W^^^}^^^ and output it. 



Suppose that a simple initiahzation, that comprises repeating {W^^^j^^j^ over i scheduUng intervals, 
is employed in each outer iteration of Algorithm |4l We can then prove that the algorithm terminates in 
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a finite number steps even for tliis simple initialization by employing arguments similar to those in the 

proof of Proposition |5l along with the fact that the alternating optimization procedure employed by the 

algorithm to sub-optimally solve (P2") monotonically improves the objective function value. A stronger 

result is stated in the following which holds for any feasible initialization. 
Proposition 6: The output upon termination of Algorithm^ is a stationary point of(P2'). 
Proof: The proof is given in Appendix 10 

Notice that each successive outer iteration of Algorithm |4] involves optimizing over a larger number 

of variables since the number of scheduling intervals is incremented by one. One variation that can 

substantially reduce complexity is to only optimize the transmit precoders {W^} for L(i— 1) + 1 < t < Li 

(which correspond to the last scheduling interval) in each outer iteration, and fix the other precoders 

to their respective values obtained in the previous iterations. It can be proved that this variation also 

terminates in a finite number of iterations but its output need not be a stationary point of (P2'). 

B. Discrete Codebook 

In this section, we consider the discrete codebook version of (P2) given by 



K 



(36) 



We assume that a set {W^^^ G Cd}f^i is available which achieves a rate-vector A ^ over any 
scheduling interval. We will show that (P2D) can be reformulated and sub-optimally solved by using 
existing algorithms from lITTl . |[T8l but at a high complexity. We first propose a novel and non-trivial 
modification to an algorithm from fTSl, which can significantly reduce the complexity and also offer a 
performance guarantee. This modified algorithm is presented here as Algorithm |5] Notice that Algorithm 
|5] involves maintaining an ordered stack S to which a set of codewords is added in each iteration. Upon 
termination, the set first added to S is used in the first scheduling interval, the set added second is 
used in the second interval and so on. Further, notice that each iteration of Algorithm |5] also involves 
(approximately) solving a maximization problem (|37a| ) by invoking Algorithm [6l 



The submodular property of the rate functions is utilized again to sub-optimally solve (I37al ) in Algorithm 
|6] We next explain how this algorithm was obtained and then state its performance guarantee. Since 
in ( |37a| ) the precoders {W^^^}^^^ across all L slots are design variables, it is necessary to define a 
concatenated ground set, T, as 

L-= {{e,i)\ ee £&z£ e {!,■■■ ,L}} . (38) 
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Algorithm 5 : To approximately solve (P2D). Input the channel matrices {H^}, I < t < L,l < k < K, 

a feasible set of precoders {^^^^}f=i yielding rate vector A ^ 0. Output the final iterates. 
Set stack 5 = t = 1,X = {1, • • • ,K} and 61^ = 0, V k. 

repeat 

Using Algorithm [6] sub-optimally solve: 

max <> riuuimil, — ^— 4 — ^ , 
{w<^)eC4L. \ 0(1 -Ok) 

and obtain {W(^)}f^^. 

Augment stack S by adding {w(*-i)^+^ = W(^)}[^^ to it, update Ok = 0k + AUW'") ^ ^ 
t ^ t + 1 and I ^ I\ {k e {1, ■ ■ ■ , K} : Ok > 1}. 
until X = (/) or > 1 — A^, V k 

Output S if accumulated rate of each user is no less than 0, else augment S by {w(*~^)^+^ = 
W(^)}|l^ and output it. 




Then, for any given subset X C {1, • • • , K} and any scalars 6^ G [0, 1) V G X, we define the set 
function / : 2— — )■ M+, as 



log 



1+ J2 PeHiWeWt(Hi)t 

(e,£')eV:£'=^ 

for any V ^ The problem in (I37ab can now be cast as 



max 



/(V) s.to Yl P-^P^ ye = i,...,L. 



(39) 



(40) 



The following proposition states an important property possessed by the set function /(•). 
Proposition 7: The function /(•) in ( 1391 ) is a monotonic submodular set function over the ground set T_. 



Proof: The proof is given in Appendix [G] 
In order to take advantage of recently developed submodular function maximization algorithms, the 
rank and power constraints in (l40l) need to be cast into the form of linear packing (knapsack) constraints. 
This can be accomplished readily by associating each element in T_ with a unique index in 1, • • • , |X^|, 
where we note |X^| = L\£\, and for each subset V CI X^ letting xy denote a binary ({0, 1}) valued vector 
of length L\E\ that has ones in positions indexed by the indices corresponding to elements in V and zeros 
elsewhere. Then, the 2L constraints in (l40l ) can be represented as Axy < b, where A is a 2L x L\E\ 
matrix whose rows correspond to the constraints and whose columns correspond to the elements of T_. 
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Thus, (l40l) can be re-cast as 



max f (V) s.to Axv < b. 



(41) 



There are some parameters from A and b, that are worth pointing out and are important for characteriz- 
ing the approximation factors of the algorithms proposed below. First, 6 := m.iiimj{bm/Am,j : ^m,j > 0} 
is defined as the width of all the packing constraints and note that 6 > I. Secondly, there are only k = 2 
non-zero entries per column of A. Thus, the constraints in (|4T]) are column-sparse ones and hence (|4T]) 
can be solved using an algorithm for submodular maximization under column sparse knapsack constraints, 
proposed in 1201 . This algorithm (whose complexity scales polynomially in \S_\L) involves randomized 
rounding combined with alteration and guarantees an constant approximation factor which does not 
depend on L. However, since that randomized algorithm is computationally demanding to implement, 
here we employ an algorithm from |[T9l . designed for approximately solving submodular maximization 
under arbitrary knapsack constraints, instead and tabulate it in Algorithm [6] Note that in Algorithm [6] 
we assume that A4{v) returns the index corresponding to any element v ^ T. Further, an expansion 
step (which is important to establish a performance guarantee for Algorithm [Sjl is added as the last step 
of Algorithm [6l To explain this expansion, we first define Cd to be a subset of Cd comprising of all 
maximal precoding matrix codewords in Cd, i.e., no precoding matrix in Cd can be expanded by adding 
any element from £_ without violating the rank or power constraints. Then, in the last step in Algorithm 
[6] we ensure that V is expanded so that each one of its corresponding set of L codewords {W^^^}^^^ 
lies in Cd- Notice that since each is a monotonic set function over £_, any arbitrary expansion will 

improve the value of the objective function. 

The following result, which holds even when no expansion is employed in the last step of Algorithm 
|6j follows upon invoking Theorem 1 from |[T9l . 

Lemma 4: Algorithm^is a deterministic polynomial-time algorithm that attains an approximation ratio 
of n{l/{2L)^/^). In other words, its final output V is feasible, i.e., Axy < b and also achieves a 
constant approximation guarantee 



Before we establish a performance guarantee for Algorithm [5] we offer the following result which will 
be invoked later. 

Proposition 8: Tlie problem in (P2D) can be further constrained without loss of optimality by enforcing 




(42) 



that each precoder used must lie in the set Cd and no more than \ 



e 



] scheduling intervals can 



vaink A, 



employ an identical set of maximal precoding matrix codewords. 



IEEE TRANSACTIONS ON SIGNAL PROCESSING (SUBMITTED) 



20 



and let i = M(v) denote 



Algorithm 6 : To approximately solve (|4TI ). Input the channel matrices {H^}, A and b as in (|4TI ) and 

an update factor A G IR+. Output a subset V. 
InitiaUze V' = 0; 

for m = 1, . . . , 2L do 

Set the variable ojm ^ ^/bm- 

end for 

while X]m=i bm(^m < K and V ^ Tdo 

Find V = argmin,e^\v' [E^^i ^m,A4(.)'^m/(/(V' M{v)) - /(V')) 

its corresponding index. 

Update V' ^ V' U v. 

for m = 1, . . . , 2L do 
Update LOm ^ WmA"^™ 

end for 
end while 
if Axy < b then 

Set V = V. 
else if f{V \M{i))> f{M{i)) then 

SetV = V \M{i). 
else 

Set V = 7W(i). 
end if 

Expand V if needed and output it. 



We are now ready to establish the performance guarantee for Algorithm |5] The proof of Proposition 
[8] as well as the one below are given in Appendix |Hl 

Proposition 9: The solution returned by Algorithm^ guarantees a weighted sum delay that is no greater 
than rin(l/e) times that of the optimal solution to (P2D), where T is a fixed constant and the scalar e 
is dependent on the input set of channel matrices, as 

e= min min (Vi?^(w(^))| (43) 

k&{i,-,K} {wweC4Li:Eti«i(w<^')>o I 1 



Note that e represents the smallest positive rate that can be achieved by using maximal codewords 
over a scheduling interval. 
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V. Numerical Examples 

In this section, the effectiveness of the proposed algorithms is shown through numerical tests, where 
independently and identically distributed Rayleigh fading between the BS and each user is assumed. 
Test Case 1: The MISO channel from the BS to each single antenna user is considered with the number 
of transmitting antennas being M = 2 and M = 4, respectively. Fig. [T] plots the maximum achievable 
rate of different schemes with respect to (wrt) the number of users K, which increases from 1 to 64, 
both in the logarithmic scales. The power budget is set to P = 10, such that the equivalent transmit 
signal-to-noise ratio (SNR) is lOdB. The proposed CAA algorithm with number of streams = 2 is 
compared with three other schemes. The optimal scheme with number of streams d = M is obtained 
by solving a semi-definite program (SDP) using SeDuMi |[33l . whereas the open-loop precoder refers to 
the case where W is a scaled identity matrix. Moreover, the recursive design proposed in 111 by setting 
d = 2 is also compared, and used to initialize the CAA algorithm besides a random initialization. Note 
that both the recursive design and CAA are constrained by d = 2 so that neither can achieve the optimal 
scaling when M = 4. Nevertheless, even in this case the CAA algorithm with both initializations keeps 
on exhibiting near-optimal performance, especially considering the fact that the optimal scheme with 
d = M = 4 provides the non-achievable upper bound. This clearly shows the near-optimal performance 
of the proposed CAA algorithm and its insensitivity to initializations. 

Test Case 2: The system settings are the same as those in Test Case 1, except for the number of receive 
antennas = 2,\/ k so that each user has a MIMO channel. For this case, the optimal precoder design 
is no longer an SDP problem, but the open-loop scheme still has the same scaling wrt K as the optimal 
one. As seen in Fig. |2l the CAA algorithm fails to achieve the optimal scaling when M = 4. However, 
inspite of being constrained by d = 2, CAA algorithm still outperforms the open-loop one with d = 4 
when the number of users is less than 32, and its advantage over an intuitive extension of recursive design 
(referred to as Rec-type design) wherein the channel matrix to the worst user is used as the transmit 
precoder after appropriate scaling, becomes more evident. For clarity a sub-figure in the linear scale has 
also been plotted. 

Test Case 3: The minimum weighted sum delay problem (P2) is now considered over a system in which 
the BS has M = 4 antennas and there are K = 8 users and the transmit precoders are constrained to 
have rank no greater than two (i.e., d = 2). In addition, the rate threshold G is set to be 10, while for 
simplicity the number of orthogonal slots per interval is set as L = 1. The CAA-based Algorithmic which 
jointly optimizes all the precoding matrices is compared with two other schemes, where in each interval 
(or equivalently here in each slot) the same precoder is employed and this precoder in turn is either 
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obtained by the recursive design or by Algorithm \T\ respectively, as detailed in Test Cases 1 and 2. The 
greedy Algorithm |4] corresponds to the reduced complexity scheme which involves solving for W* after 
fixing all precoders prior to slot i. Further, two approaches for initializing Algorithm |4] are considered: 
(il) upon incrementing to slot i augmenting with W* = W while fixing all precoders obtained prior 
to this slot; and (12) at each slot increment simply setting = W, VI < r < t, where W is the 
solution obtained using Algorithm [T] Both the per-user MISO channel {Nk = 1, V A;) and the MIMO 
channel (Nk = 2, \/ k) are considered with uniform weights fik = '^/K,\/ k. In addition, unequal user 
weights are also considered for the Nj^ = 2 case by setting fi^ = 0.9 for the user k := argmin,t Rj. as 
determined by the solution of Algorithm [U and /Xfc = 0.1/ {K — 1) for any other user k ^ k. The exact 
weighted sum delay which is the objective in (P2) and the relaxed one associated with (P2') are plotted 
versus the power budget (per slot) P, in Fig. [3l Clearly, the joint optimization schemes of Algorithm 
|4] yield improvement over the other ones, particularly so when unequal weights assigned, as expected. 
Meanwhile, the curves of greedy Algorithm |4] are quite close to the ones of the original Algorithm IH 
which greatly advocates the use of the reduced complexity scheme in practice. Interestingly, the relaxed 
delay curves exhibit the same relative behavior as their exact delay counterparts, which justifies using 
the relaxed problem (P2') to design transmit precoders that reduce the weighted sum delay. 

Test Case 4: We now examine optimization using the discrete codebook Cd- We consider the rate 
optimization in (l22l) over a system with five users, with = 2, \/ k receive antennas and where the base 
station has M = 4 transmit antennas. The rank one LTE codebook comprising of 16 unit-norm vectors 
|[36l formed the base codebook W and for each codeword an identical set of four power levels is allowed, 
which together specify the set ground set £_. In Fig. H] we plot the achieved throughputs for different 
values of transmit SNR. In particular, we have plotted the throughput upper bound obtained obtained by 
solving (l28l ). as well as that yielded by Algorithm [3] when the latter is invoked with 5 = 0, e = .08 along 
with its practical refinements discussed in Section IIII-BI For comparison, we also plot the throughput 
yielded by a simple greedy algorithm, which at each step selects the element from £_ yielding the largest 
increase in the instantaneous rate subject to the transmit power constraint. Note that at moderate values 
of SNR Algorithm [3] yields a good improvement over the simple greedy baseline. The gains are lower at 
high SNRs since in that regime the transmit power constraint becomes increasingly irrelevant (i.e., most 
of the codebook can be selected). We emphasize that the upper bound which relaxes the binary-value 
constraints need not be achievable, particularly at low SNRs. 
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VI. Conclusions and Future Work 

We considered the design of linear precoders for multicast by using instantaneous rate and weighted 
sum delay as the design criteria. The linear precoders were allowed to be any complex valued matrices 
subject to rank and power constraints (a.k.a. the continuous codebook case). Alternatively, the linear 
precoders could be constructed by selecting and concatenating codewords from a given finite codebook 
of precoding matrices (a.k.a. the discrete codebook case). For the former case, cyclic alternating ascent 
(CAA) based algorithms were proposed, whereas for the latter case greedy algorithms that exploit 
submodularity of the rate function were proposed. The proposed algorithms were shown to possess 
certain desirable properties such as satisfying KKT conditions and offering worst-case guarantees. 

The CAA based algorithms offer good performance but their complexities can be deemed high for 
some implementations, since they involve solving an SOCP in each step. An interesting avenue for future 
work would be to determine whether explicit solutions can be obtained for special instances and then 
leverage them. On the other hand, the greedy algorithms for the discrete codebook case are simple to 
implement. However, the performance guarantee obtained for the weighted sum delay minimization might 
be weak and the design of approximation algorithms with better guarantees is an open problem. 

Furthermore, recall that the quasi-static assumption adopted for the weighted sum delay minimization 
problem allowed us to use any arbitrary number of scheduling intervals to ensure that the threshold for 
each user is achieved. In problems where a strict limit on the number of intervals is present, we would 
require an admission control module to select a multi-cast group of users and/or to set an appropriate 
threshold to ensure that decoding at all users can be achieved. Extending our proposed techniques to 
design such a module is an interesting open problem. Finally, developing robust versions of the results 
developed in this paper, by adopting a bounded CSI error model (as in [Si, 111) is also an interesting 
problem. While such an extension is not difficult for the continuous codebook case, its discrete counterpart 
seems challenging since the submodularity property may no longer hold for the worst-case (over all error 
realizations) per-user rate. 

Appendix A 
Proof of Proposition [T] 

To proceed, define the objective for the inner minimization in ( fT6l ) as 

rfc(W,Gfe,Sfc) = -Tr[SfeEfe(Gfe,W)]+log|Sfe|+d, l<k<K (44) 



IEEE TRANSACTIONS ON SIGNAL PROCESSING (SUBMITTED) 



24 



and the one for the outer maximization as 

ff(W,{G,.,Sfe}) = min n{W,Gk,Sk). (45) 

k=l,...,K 

Moreover, let i € Z4. be the iteration index for the while-loop in Algorithm [T] and it is initiahzed with 
the input precoder W(0). Further, denote the maximal objective values achieved before and after the 
precede update at the z-th iteration as 

5i = 5(W(i-l),{Gfc(i),Sfc(i)}), 

= 5 (W(i), {Gfc(i), Sfc(i)}) Vi G Z+ . (46) 

The ascent nature of the iterations in Algorithm [T] ensures the sequence {gi} is monotonically non- 
decreasing and hence convergent, and also gi < C,i ^ 9i+i^ £ ^+ ■ Due to the boundedness of 
{||W(i)||} ensured by the norm constraint in ([T6l ). there exists a subsequence X such that W(i) — >• W, 
« G X. Line 3 of Algorithm [T] indicates that Gk{i + 1) is obtained from an analytical function of W(z), 
thus it follows that for any k, Gk{i + 1) — )■ Gfc, i G X. Similar argument holds for each Sfe(i + 1) — )■ Sfc, 
i G X. Consequently, the convergence for the objective value sequence follows, as 

^g:=g (W, {Gfe, Sfc}) , i G X. (47) 

Note that since the sequence {gi}i£x converges and it is a subsequence of the convergent sequence 
{9i}iez+, we must have that gi ^ g, i G Z+. Further, the monotonicity of {g'i}iez+ and the relation 
9i ^ Ci ^ 9i+i ensures that d ^ g, i G X as well as ^j+i -^9, i £ I. 

Next, we want to show that (W, {G^, S^}) constitutes a fixed point for the CAA iterations. Since the 
updates of G^ and are both closed-form for any k, it is easy to see that 

Gk = (HfcWWtH[. + lN,y^ HfcW (48) 

Sfc = WtH^HfcW + Irf. (49) 

Thus, it remains to prove that W G >V({Gfc, S^}), where the later represents the optimal solution set of 
the SOCP problem ([T8] ) given the inputs G^ = G^ and = S^. Recall that the subsequence {Ci+ijiex 
converges to the limit point g. Therefore, if it can be shown that 

Cm^5(U,{Gfc,Sfc}), iGX, (50) 

for some U G lA^dG^, Sfc}), then we can deduce that g (U, {G^, Sfc}) = g from which it follows that 
W G yV{{Gk, Sfc}). To show (l50l ). consider the following sequence of functions in W 

/i,+i(W) :=5(W,{Gfc(i + l),Sfc(i + l)}), ViGX (51) 
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where Ct+i = ^i+i(W(i + 1)) = maxw:||w|||,<p Since each function is quadratic in W 

(with Gfc and Sfc given), and /ij+i is the minimum of a finite number of such r^'s, it can be shown that 
the sequence of functions converges point-wise to the following function 

h{W):=g{W,{Gk,Sk}). (52) 

Further, since only the compact set defined by the norm ball ||W|||. < P is of interest, point- wise 
convergence in {/ii+i(-)}je2 leads to the uniform convergence; that is, for any e > 0, there exists an 
iteration index i' € I such that |/ii+i(W) - h{W)\ < e, Vi G X, i > i', and ||W|||, < P. From this 
uniform convergence, it holds that 

0-\-^ = max /i,;4_i (W) < max IhCW) + el 

max /i(W) + e, yiel,i>i', (53) 

W;||Wi|2 <P 



and similarly 



6+1= max /ii+i(W) > max \h(W) — e] 

W:||W|||,<P W:||W|||,<P ^ ^ ' ^ 

= max h(W) - e, Vi G X, i > i', (54) 

W;||W|||,<P 

and this leads to the following convergence 

Ci+i^ max h{W)=g{V,{Gk,Sk}), iel, (55) 

W;||W|||,<P ^ ' 

for some U G W({Gfc, 8^}), which is sufficient for claiming dSOl ) and completing the proof that 
(W, {Gfc,Sfc}) is a fixed point for the CAA iterations. With (W,{Gfc,Sfc}) in hand the remaining 
part of the proposition follows by first noting that 

[-TY(SfeEfe(Gfc,W))+log|Sfe| + c?] |^^^ = log|I + HfcWWtH^| 1^^^. (56) 



Then, specializing the gradient formulas in 11161 to our case we get that 

Vw [-Tr(SfcEfc(Gfc,W)) +log|Sfc| +d] = H[HfcWEfc(Gfe, W)SfcEfc(Gfe, W) (57) 

and 

Vw log |I + HfeWWtH^I = H[HfcW(I + WtH^H^W)-! (58) 

so that 

Vw [-Tr(SfcEfc(Gfc,W))+log|Sfc|+(i] |^^^ = Vw log |I + HfcWWtH^I |^^^ (59) 

Using ( [56b and ( [59l ) we can conclude that (W,{Gfc,Sfc}) satisfy the KKT conditions of (PI) as well. 

□ 
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Appendix B 
Proof of Proposition [2] 

Consider any subsets ^ C C ^ such that = U}JV_. Note that it suffices to consider e' G £\U^ since 
the proposition is trivially true for e G I/. Define a function fk{A) = See^^l^'eHfcWeWeHl,, C £_. 
Then, for any element e' G ^ \ we have 

RkiUl U {e'}) - Rkill) = log |I + /fc({e'}) + h{ll)\ - log |I + 



=i?fc(Z^U{e'})-i?fc(Z^) + log 



I + Pe^l^l (I + A({e'}) + fkU)) ' HfcW, 



egV 



log 



I + J^peWtH^, (I + fkiWr' HfcW, 



eeV 



(60) 



Note that (I + fk{{§.'}) + fk{U.)) ^ ^ (I + fkiU)) where < denotes the positive semi-definite order- 
ing, since fk{{§.'}),fk{ld.) are both positive semi-definite matrices, from which we can deduce that 



log 



1 + J^PeWtRT (l + /fc({e'}) + A(ZY)) 1h,w, 
eev 



< log 



I + J^p^wtH^ (I + hiWr' HfcW, 
eev 

(61) 



Substituting (I6TI ) in (1601) leads to (1201) . The remaining parts can be readily verified to be true. □ 



Appendix C 
Proof of Proposition [3] 

To show the hardness of the discrete precoder design problem (l22l) . consider an instance of the hitting 
set problem, which is among Karp's 21 NP-complete problems |[27l . Specifically, with a collection of K 
subsets {Sk}k=i of a ground set S_, and a positive integer P', the goal is to find whether there exists 
a hitting set 5' of size P' or less, that is, a subset 5' C 5 such that < P' and n Sk 7^ 0, 
\/k = 1, . . . , K. For convenience, given any element s G 5, let the indices of those subsets that s belongs 
to form the set K{s) C {1, . . . , K}, such that s G Sk, VA; G K{s), and s ^ Sk, VA; ^ K{s). Furthermore, 
we restrict our attention to instances constrained to satisfy K = 0(|5|^) for any arbitrarily fixed positive 
integer A > 2. We note that the hitting set problem remains NP hard even under such restriction |[27l . 

To map this hitting set instance to one instance of (|22]) . wlog let K = K and assume that each user 
k is equipped with one antenna; i.e.. A*";; = 1, VA;. Let the number of transmit antennas M = K, and set 
the channel gain vectors {Hfc} to be orthonormal such that HfcH| = S^^u where the later represents the 
Kronecker delta operator. For the discrete codebook Cd, consider a flat power profile for each codeword 
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as pe = 1, \/e £ £_. Moreover, for any element s £ S_ and its companion set J^(s), there exists an element 
g(^s) = e£ £_, such that the corresponding codeword has rank 1, in the form of 

1 



(62) 



Notice that any element s with K{s) = can be included as a special case for this codeword setting, 
which simply renders the con^esponding We = 0. Under this codeword definition, the achievable rate at 
user A; as a set function in ( fT9l ) becomes 



RkiW = log 



1 + 



1 




1 



(63) 



If there exists some e[ £U_ such that its corresponding s' = g ^(e') G S^, then it holds 



RkiW > log 



l + Hfc 



1 



ieK{s') 

log(l + l/|i^(s')|) >log(l + l/K) 




H 



(64) 



where the last inequality comes from < K. Otherwise, if for any e' G L[, the corresponding 

s' ^ Sk, then it can also be shown that Rk{ld.) = 0. Therefore, for each set Sk, the set function 
RkiU!) > log(l + l/K) if the subset S' = {g~^{e) : e G U'} ^ 5 corresponding to all codewords in U_' 
intersects Sk, and otherwise. If we assume an optimal solution to the hitting set instance is S_* of size 
no greater than P', then for the corresponding set U_* we have mink RkiW) ^ log(l + l/K). For any 
other S' that is not a hitting set, we have the corresponding mink RkiU.') = 0. 

Now consider the precoder design problem (l22l) under the current settings. Due to the flat power profile, 
Pi4_ < P is equivalent to a cardinality constraint \L{_\ < [P\. To establish the connection to the hitting 
set problem, let [PJ = P'. If there were an algorithm for ((22]) with approximation guarantee 7(|^|), it 
would select a set U' of size \U\ < P' with 



minRk(ll')>im) 

k 



min Rk [U.* 



7(1^1) log(l + l/K) >0. 



(65) 



This implies mink RkiU!) ^ log(l + 1/K)> and thus the subset 5' C 5 corresponding to would be 
a hitting set. Accordingly, this approximation algorithm would be able to decide, whether there exists a 
hitting set of size P', contradicting the NP-hardness of the hitting set problem |[27l . □ 
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Appendix D 
Proof of Proposition |4] 

Note that Algorithm [3] clearly converges and let c denote the value of c obtained upon convergence. 
Invoking Lemma [3] we can conclude that 

RciiL) >c{l-5) (66) 

with < P(l + ln(l/(^)). Further since that value c + e cannot be achieved by Algorithm [2] without 
exceeding the budget ^(1 + In (1/(5)), from Lemma [3l (l26l) and (l25l) we can also deduce that > P 
so that 

max mmRkiU) < c + e. (67) 

U:pu<P k 

Next, expanding ReiU.) = (1/^) c} and using (l66l ). we can show via contradiction 

that we must have 

RkiW > c(l - K6), yi<k<K. (68) 
(|67] ) and (|68] ) together prove the proposition. □ 



Appendix E 
Proof of Proposition [5] 

We first assume that for the given input channel set there exists a set of feasible precoding matrices 
{WW}[^1 such that Y.e=i ^i(W('^))/e > A, for all users 1 < A; < A' and for some A > 0. Note that 
this assumption is not satisfied only if one or more users have mutually orthogonal input channels, i.e. 
^^^j^(H^)^Hj = for some k / j. In that case users can be partitioned into multiple groups with each 
group satisfying the aforementioned assumption and the arguments given below can be used separately 
over each group. Then, since G is finite, a feasible solution to ensure that each user decodes the common 
message is to repeat {W^^^}^^^ over [^] scheduling intervals which then yields a finite value for the 
objective function in (P2'), denoted henceforth by G. Letting {W^p^} be any optimal solution, we can 
deduce that the optimal objective function value for (P2') yielded by it is clearly finite and no greater 
than G. By contradiction, it can then be argued that for each user k, X]t=i -^K^opt)/® > 1 — A for 

all t > -X — ^ . Then, since jW^^Hf , achieves a normalized rate no less than A for each user in a 

scheduling interval, invoking the optimality of {W^p^^} we must have that X]t=i ^^(^opt)/© ^ 1 for 
all t > f = 1 + A mint fik ' Consequently, without loss of optimality the given optimal solution can be 
truncated after Lt. □ 
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Appendix F 
Proof of Proposition [6] 

Suppose i is the value for the number of scheduling intervals returned upon termination of the while— do 
loop and let {W^}:^^]^ denote the iterate returned by it. Then, using arguments similar to those made 
to prove Proposition [T] it can be shown that {W^}:^^]^ is a stationary point of (P2") (evaluated for that 
i). Thus, {W^}:^^^ must be feasible, i.e., € Cc, V r and also satisfy the other KKT conditions 
for (P2")- Let A^(W^),B(W^) denote the derivatives of i?fc(W^) and HW^Hl, (with respect to the 
precoding matrix argument) evaluated at W^, respectively. Further, let Ck = max{t G {0, 1,-- - ,t} : 
-Rfc(W^) < 6}, V A;, where we note that Ck = if Y.^^^ Rli^^) > ©■ Then invoking the KKT 
conditions for (P2"), after some manipulations we can deduce that there must exist non-negative scalars 
5^,1 <T < Li such that 

AI(W")/ife (Ck - r^l + l) = 5"B(W"), 1 < T < Li. (69a) 

Clearly using this {W^}^^^, the accumulated rate of each user k after i scheduling intervals is no less 
than — A^.. We only consider the case where at-least one user's accumulated rate is less than since 
the remaining one can be proved in a similar manner. Then, letting T = max{i, t + 1}, where we recall 
i was implicitly defined in Proposition |5l we consider the KKT conditions for the following problem 



K 

{W-GCc} 



k=l t=l 



(70) 



Note that any optimal solution of (P2') (truncated without loss of optimality after interval T) must satisfy 
the KKT conditions for (ITOl) . Now consider the augmented set {W^}:^!^^, where 

W^, If 1 < r < Li 
= ( wir~Li)^ Elself Lt + 1 < r < L(l + t) 
Otherwise 

Letting Ck = max{t G {0, 1, • • • , T} : Yll=i -Rfc(W^) < 0}, V A;, a key observation is that Ck = Ck < 
i, V k. This fact along with (I69al ) allows us to conclude that 

Y AliW-)f,k{Ck-\j]+l) =5^BiW-), 1<T<LT, (71a) 
which suffices to satisfy the KKT conditions for dTOl ) and hence those for (P2'). □ 
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Appendix G 
Proof of Proposition [7] 

The monotonicity of /(.) can be readily verified. Consider the set function : 2— 
/c G X defined as 



gUV) = log 



for any 



(72) 



1+ PeH^WeWt(H|) 

which from Proposition |2] can be deduced to be a submodular set function over T. From this fact, it 
follows that the functions gk,e{V.) = 5fc(Vn£^), V V C for 1 < ^ < L are all submodular set 
functions, where 



Z':={{e,i)\eG£}}. 
so that form a partition of J^. Consequently, the set function 

L 



1 



(73) 



(74) 



e(i-^.)^' 

being a linear combination of submodular set functions in which the combining coefficients are all positive 
constants, is a submodular set function over T. Next, since truncation preserves submodularity, we can 
conclude that fk{V) = inm{gk{V), 1}, VVC^isa submodular set function over T. Finally, we can 
expand /(•) in ( [39l ) as 

/(V) = J]/ifc/fc(V), VVC j;, (75) 

kex 

which again being a linear combination of submodular set functions (with positive and constant combining 
weights) is thus a submodular set function over J^. □ 



Appendix H 
Proof of Propositions [8] and [9] 

We will first prove Proposition [8] Here, the fact that each RK-), 1 < k < K,\/ t is a. monotonic set 
function over £_ suffices to assert that the problem in (P2D) can be further constrained without loss of 
optimality by enforcing that each precoder used must lie in the set Cd- Further, without loss of generality, 
we can assume that each codeword in the given set {W^^^}^^^ is maximal since otherwise the set can 
always be arbitrarily expanded. Suppose now that an optimal solution involves employing an identical 
set of maximal precoding matrix codewords, {W^^^ G Cd}f^i, for more than Q = [ ^111? A^ l scheduling 
intervals. Consider the first Q scheduling intervals over which the set {W^^) G Cd}f^i is used. Note 
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that upon using that set over Q scheduUng intervals, each user k for whom the accumulated rate is less 
than 9 must satisfy Yld=i < miiij Aj. As a result, all further uses of the set {W^^) G Cd}^^i 

can be replaced without loss of optimality by those of the set {W'^^^ G Cd\^^i, since in any scheduling 
interval the latter set can simultaneously achieve a larger rate than {W^^) G '-d}^=i for ^^^h remaining 
user. Finally, no more than Q uses of the set {W^^^ G Cd}i=i are needed to ensure an accumulated rate 
of at-least for each user. 

We now prove Proposition |9l Towards this end, let us now construct a matrix R having K rows, one for 
each user. To build the columns of R, enumerate all possible sets of maximal codewords { W^^^ G Cd}i=i 
and repeat each set f ■ ® a 1 times. Next, add a column in R for each such set, where the column contains 

^ I mirifc L^k ' 

the rates (in a scheduling interval) achieved upon using that set for all K users. Clearly, then the sum 
of each row of R is at-least 0. Further, after this reformulation upon invoking Proposition |9j we can 
deduce that the problem (P2D) is in-fact equivalent to finding a permutation of the columns of the matrix 
R that minimizes the weighted sum cover time over the rows, where the cover time of each row is the 
smallest column index for which the partial sum on that row is at least Q. The latter problem is an 
instance of the ranking with additive valuations problem considered in lITTl . It has been shown in IITtI 
that solving a linear program (LP) followed by a randomized rounding procedure can give rise to a 
column permutation solution which achieves a weighted sum cover time no greater than a constant times 
the optimal one. However, the number of constraints in the pertinent LP here grows exponentially with 
the number of columns in R which requires additional processing to avoid exponential complexity, but 
still renders this method prohibitively complex. Another deterministic algorithm with a weaker guarantee 
has also been proposed for the ranking problem ifTSl . However a direct adaptation of this algorithm to 
(P2D) will yield Algorithm |5] albeit where the maximization in (I37al ) must be optimally solved over 
{W(^) £ C(i}f^i- The latter optimization problem is hard to solve (indeed it is NP-hard) which can 
dramatically increase the complexity due to the potentially large cardinality \Cd\^- The key modification 
introduced in Algorithm [5] is to sub-optimally and efficiently solve (I37al ) over a larger set {W^^) G Cd}f^i 
instead, after recognizing it to be a submodular maximization problem subject to knapsack constraints. 
Then, since (I37al ) is approximately solved with a constant-factor approximation guarantee by Algorithm 
|6l which we note also returns a set of maximal codewords, a careful verification of the proof in ifTSl 
reveals that Algorithm |5] retains the 0(ln(l/e)) performance guarantee of the direct adaptation. Indeed, 
the effect of sub-optimally solving (I37al ) is that the constant F in Proposition |9] is larger (by a factor 
0(l/(2L)i/'^)) compared to the case when (|37a| ) is optimally solved. 
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Fig. 1. The maximum achievable rates, with (a) A4 = 2 and (b) AI — 4 transmitting antennas and = 1 receive antenna, 
versus number of users K for different schemes (P — 10 and d = 2). 
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Fig. 2. The maximum achievable rates, with (a) M = 2 and (b) M = 4 transmitting antennas and N = 2 receive antennas, 
versus number of users K for different schemes (P — 10 and d = 2). 
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Fig. 3. The weighted sum delay with M = 4 transmit antennas and {a) N = 1 receive antenna with equal user weights and 
(b) = 2 receive antennas with equal user weights and (c) A'^ = 2 receive antennas with unequal user weights, versus the 
transmit power for different schemes. 
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Fig. 4. The maximum achievable rates with M — 4 transmit antennas and N — 2 receive antennas, versus transmit SNR for 
different schemes. 



